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ROBUST NEAREST-NEIGHBOR METHODS FOR 
CLASSIFYING HIGH-DIMENSIONAL DATA 

By Yao-ban Chan and Peter Hall 

University of Melbourne 

We suggest a robust nearest-neighbor approach to classifying 
high-dimensional data. The method enhances sensitivity by employ- 
ing a threshold and truncates to a sequence of zeros and ones in 
order to reduce the deleterious impact of heavy-tailed data. Empiri- 
cal rules are suggested for choosing the threshold. They require the 
bare minimum of data; only one data vector is needed from each 
population. Theoretical and numerical aspects of performance are ex- 
plored, paying particular attention to the impacts of correlation and 
heterogeneity among data components. On the theoretical side, it 
is shown that our truncated, thresholded, nearest-neighbor classifier 
enjoys the same classification boundary as more conventional, non- 
robust approaches, which require finite moments in order to achieve 
good performance. In particular, the greater robustness of our ap- 
proach does not come at the price of reduced effectiveness. Moreover, 
when both training sample sizes equal 1, our new method can have 
performance equal to that of optimal classifiers that require inde- 
pendent and identically distributed data with known marginal dis- 
tributions; yet, our classifier does not itself need conditions of this 
type. 

1. Introduction. In classification problems where sample size is much 
smaller than dimension, nearest-neighbor methods, after truncation to re- 
duce noise, can enjoy particularly good performance. They have the poten- 
tial to be highly adaptive, not least because they do not require explicit 
assumptions about marginal distributions. 

However, in very high-dimensional settings, conventional nearest-neighbor 
methods can be adversely affected by "noise" from vector components that 
do not carry useful information for classification. Moreover, they are not 
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robust against outliers. In particular, they can be influenced considerably by 
heavy-tailed features of sampling distributions and can fail to give accurate 
classification when marginal distributions do not enjoy finite variance. Their 
sensitivity to correlation among data components, particularly in very high- 
dimensional contexts, is not well understood. And their performance in high- 
dimensional, highly heterogeneous cases, where the tails of distributions can 
vary from very light to very heavy within the same data vector, is largely 
unknown. 

These phenomena occur often in the area of gene microarray analysis. 
Each microarray represents thousands of gene expression levels, but the 
sample size is typically small. Furthermore, the underlying distributions of 
the gene expressions levels are generally unknown and are likely to be het- 
erogeneous, heavy-tailed and significantly dependent upon each other. With 
these features, conventional nearest-neighbor methods for analysis are likely 
to be ineffective. 

In this paper, we shall suggest a robust nearest-neighbor classifier, where 
thresholding and truncation to zeros and ones are used to increase perfor- 
mance and, in particular, to remove sensitivity to heavy-tailed behavior. 
Choosing the threshold appropriately is the key to good classification accu- 
racy. Threshold selection must adapt both to distribution type and to the 
ways in which populations differ from one another. We suggest a simple 
and practicable approach to selecting the threshold. Unlike cross-validation, 
our technique gives good performance even when there is only one training 
data-vector from each population. 

We shall use theoretical arguments and numerical simulation to show that 
our technique is relatively insensitive to dependence among vector compo- 
nents, and that it enjoys good classification accuracy in high-dimensional, 
highly heterogeneous cases. In settings such as these, the performance of 
truncated nearest-neighbor classifiers can surpass that of competitors, such 
as methods based on extrema or on false-discovery rate (FDR) ideas. The 
latter two approaches are often identical; see Jin [14] and Donoho and Jin [5]. 

Nearest-neighbor methods are popular because of the wide variety of data 
types for which they are appropriate. Their implementation requires only 
a measure of distance and, in particular, is not founded on distributional 
properties of the data. Therefore, nearest-neighbor classifiers enjoy a high 
degree of acceptance in settings involving complex data, for example, in 
pattern recognition. See Dasarathy [3] and Shakhnarovich, Darrell and Indyk 
[17], for instance. 

Properties of nearest-neighbor classifiers in classical settings, where di- 
mension is small relative to sample size, are quite well understood. See, in 
particular, Devroye, Gyorfi and Lugosi [4]. Chapter 5 of that monograph is 
an excellent guide to the literature. There is a very large number of papers 
on nearest-neighbor methods in other settings, and it is possible to mention 
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only a few of them here. Early contributions include those of Cover and Hart 
[2] and Cover [1], who gave upper bounds to risk, and Wagner [18] and Fritz 
[7], who derived convergence properties of the error rate. Psaltis, Snapp and 
Venkatesh [16] extended Cover's [1] results to higher dimensional settings, 
but still with the number of dimensions much less than sample size. Kulkarni 
and Posner [15] and Hoist and Trie [10] discussed the case of dependent data 
vectors. 

In these relatively classical treatments, it is common to regard the order, 
k, of a nearest-neighbor classifier as a tuning parameter and, perhaps, to 
attempt to optimize over it. However, in a variety of contemporary appli- 
cations the number of data in each sample is so small, especially relative 
to dimension, that there is little point in taking k larger than 1. We argue 
that, in such cases, the information that is critical to good performance is 
accumulated not through the number of data, but through the many compo- 
nents of each data vector. With that in mind, in this paper we shall optimize 
performance in a way that is sensitive to dimension, rather than to sample 
size. 

2. Methodology. 

2.1. Sparsity and truncation. Assume we observe random p- vectors 
Xi, . . . , Xm. and Yi, . . . ,Yn, drawn from X- and F-populations, respectively. 
We wish to construct a classifier, for the purpose of ascribing a new p- vector, 
Z say, to either population. 

Suppose it is known that the respective components of X and Y distribu- 
tions are similar, except that one of them has, for a potentially sparsely 
arrayed sequence of component indices, generally higher mean than the 
other. We can formalize at least part of this assumption, by asking that, 
if X = (X(i),...,X(P))T andy=(yW,...,y(P))T, then, 

for each k, (a) X^'^) and yC^) -£;(y('=)) have similar distri- 

(2.1) butions, and (b) E{Y^'^^) > E{X'^''^); and, for a potentially sparsely 
distributed sequence of indices k, (c) E(yW) > ^(X^). 

The one-sided nature of parts (b) and (c) of (2.1) motivates a one-sided 
classifier. Alternative methodology and theory, very similar to that which 
we shall develop below, are available in the two-sided case. 

In view of the possible sparsity, it seems reasonable to truncate compo- 
nents of the data vectors by deleting those that do not attain a threshold, 
t say. This has the effect of reducing the amount of noise that is present in 
coordinate values that convey little or no information for classification. 

There are a variety of ways of implementing a procedure such as this. 
For example, we may do it by replacing each component by if it is less 
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than t, and by 1 otherwise. That is, defining if'^ = I{x\^'^ > t), J-^^ = 
/(yf) >t), /, = (/f),...,4^))T and J, = (4^\...,J,(^))T, we may build 
the classifier using the indicator data vectors Ii, . . . , Im, Ji, ■ ■ ■ , Jn- Alterna- 
tively, we could base the classifier on X[,. . . , X!^,Y{, . . . ,Y^, where X^'^^' = 
) > t), yI"^' = Yl'hiYl'^ > t), X[ = (Xf . . . and F/ = 

Using the indicator data, we could conclude that Z came from the X 
population if 

(2.2) min ±{4'^ - K^'^rf < min ±{.f^-K^'^)\ 

l<i<m, — ' l<i<n — ' 
k=l k=l 

where K^^^ = I{Z^''^ > t); and that Z was from the Y population otherwise. 
Alternatively, in place of (2.2) we could use the criterion 

(2.3) min ^^(xf )' - Z^')^ < inin ^^(yf )' - ZW')^ 

l<4<n ^ — ' l<i<m ^ — ' 
fc=l k=l 

where = Z('=)/(Z('=) > t). In this case, if (2.3) were true, then we would 
conclude that Z was from the X population. However, relative to methods 
based on (2.2), this approach would suffer more from stochastic variabil- 
ity and, hence, be less robust, in cases where X and Y had heavy-tailed 
distributions. 

2.2. Empirical choice of t . We suggest a method based on thresholding, 
as follows. Let ix and iy denote the respective values of i at which the min- 
ima on the left- and right-hand sides of (2.2) are achieved. In this notation, 
(2.2) is equivalent to T < 0, where 

(2.4) T = T{t) = X:(/g^ - 4'^)(1 - 2K('^). 

k=l 

Let denote a sequence diverging to infinity; put 

(2.5) Zp = ^plogp, 
denoting a threshold; let 

(2.6) S' = S{tf = j:{lff + jtJ) 

k=l 

and define t = 9 by: 

9 is the infimum of values t >0 such that \T{t)\/ S{t) > Zp; or, if 

(2.7) no such t exists, take t to be a default value, for example, t = or 
t = — oo. 
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In the case of independent components, it is feasible to use a smaller 
threshold, defining Zp by 

(2.8) Zp = ^p{\ogpfl\ 

where ^p oo, instead of by (2.5). Nevertheless, (2.5) is also appropriate in 
the case of independence. If, in (2.7), it were necessary to pass to the default 
value, then we would conclude that the classification problem was marginal. 
That is, there was insufficient information to solve the problem reliably. 
With t = given by (2.7), the classifier suggested by (2.2) is as follows: 

, . Classify Z as coming from the X population, if T{6) < and as 
^ ' ' coming from the Y population otherwise. 

Our theoretical justification for (2.5) will be based on the assumption 
that the components X^'^^ and Y^^^ are produced by a generalized form of 
an infinite-order moving average. The generalization permits marginal dis- 
tributions to vary extensively from one component to the next, so that they 
are heavy-tailed for some indices k but light-tailed for others. Alternative 
models for weak dependence, for example based on autoregressive processes, 
can be shown to also lead to the threshold choice at (2.5). 

From at least a theoretical viewpoint, exact choice of ^p is largely unim- 
portant. Any sequence, for example, (,p = logp, which diverges more slowly 
than any polynomial is appropriate. In this way, the sensitivity of the tuning- 
parameter selection problem is greatly reduced; we pass from the parameter 
t, to which the classifier is very sensitive, to ^p, to which the classifier is 
largely insensitive. Practical, empirical choices of will be discussed in 
Section 4. 

Motivation for a threshold-based approach to choosing t can be provided 
as follows. Neglecting, for the moment, the fact that ix and iy at (2.4) 

are random variables; taking the components I^^^ and J^^ to be completely 
independent, for each k and i, and conditioning on the new data vector Z; 
the random variable T, at (2.4), is seen to have variance equal to 



(2.10) var(/g) - " 2^^'^)' = E{var(/g^) + var(4^))}, 

k=l k= 



where the identity follows from the independence assumed earlier in this 
paragraph and from the fact that 1 — 2K^^^ = ±1. Under the assumptions, 
\ai{l\^) = {El\^){l — Elf^) < E{lf^), with an analogous result holding 

for Ysx{lf''^). Therefore, 5^, at (2.6), tends to overestimate the right-hand 
side of (2ri0): 

p 



i?(S2)>E{var(/g))+var(ji^')) 

k=l 
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This slight conservatism, and the log factor in the threshold Zp, provide 
opportunities for repairing errors that arise from failure of the independence 
assumption. 

In the argument above, we defined T to be the difference between the two 
sides of (2.2), rather than between the two sides of (2.3). Indeed, it can be 
awkward to estimate the variance of T if we use (2.3) and do not have a good 
model for the distributions of X^'^^ and Y^''^ . There are ways of overcoming 
this difficulty, but we do not find them as attractive as working with (2.2). 

An alternative approach to selecting t could be based on standard cross- 
validation, taking 9 to be the infimum of values t that minimize the error-rate 
estimator, 

CV(t) = - ^/ min \\X[^ - XlW > min \\Y^ - Xl\\ 
m 1<J<" / 

1 " / \ 
+ -y I[mm\\Y'-Y'\\> min \\Xl-Y'\\]. 
np^^ ■' l<i<m." ^"J 

However, this technique has the disadvantage that it works only when m 
and n both exceed 1. Moreover, in most problems there is a continuum of 
values of t that minimize CV(t), and so cross-validation does not give an 
explicit answer to the tuning-parameter choice problem. 

2.3. Example of mixed light- and heavy-tailed components. When both 
light- and heavy-tailed data components are present in each data vector, and 
only a very small proportion of the components differ through perturbations, 
it can be particularly difficult to achieve good classification using standard 
distance-based methods, such as support vector machines. In the case of 
these approaches, the accumulation of noise from irrelevant components can 
drown out the signal in those few components that convey information for 
classification. Methods such as FDR, based on extrema, can bring substan- 
tial improvements in performance. However, when data distributions are 
heterogeneous, those techniques too can have difficulty. 

To illustrate this point, assume for the sake of simplicity that all vector 
components are mutually independent. Suppose that X consists of just p^~^ 
components with standard normal distributions, where /5 € (0, 1), and p — 
p^~^ components having exponential distributions, for which P{X^^^ > x) = 
e~^ when x > 0. Construct the Y variable by adding fi = rlogp, where r > 0, 
to just p^~^ of the components of X, leaving the others unaltered. 

If these special p^~^ components are among those that have an exponen- 
tial distribution, then we can write 

(2.11) max X^'^) = Qi + logp + Op(l), 
i<fc<p 

(2.12) max y('=) = max{Qi + logp, Q2 + (1 - /9 + r) logp} + Op(l), 
i<fc<p 
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where Qi and Q2 are asymptotically independent and have the extreme- 
value distribution function exp(— e~^). In both these expansions, we can 
consider Qi + logp to equal the maximum of the p — 2p^~^ components 
of X that have an exponential distribution and which exclude the p^~^ 
components to which the perturbation // is added to form Y, and Q2 + 
(1 — /3 + r) logp to be the maximum of the p^~^ components of Y that are 
obtained by perturbing components of X. 

It follows from (2.11) and (2.12) that if r > /?, then max^ — max^ 
00. More specifically, when r > (3, the maximum of the components of a new 
vector Z can be used to obtain asymptotically correct classification. This 
result does not hold if r < /?. 

On the other hand, if the perturbation fi is added to each of the p^~^ 
components of X that have a normal A^(0, 1) distribution, and if none of the 
exponentially distributed components of X is perturbed, then 

max y^'') = max[(5i + logp, {2(1 - /3)(logp)}^/^ + r logp] +Op(l) 

i<fc<p 

and P(maxfc X^ = max^ Y^) 1, unless r > 1. In particular, in this case, only 
when r > 1 is it possible to discriminate between the X and Y populations 
using extrema or FDR. 

By way of contrast, we shall show in Section 3 that, no matter where the 
perturbations are added, the nearest-neighbor method produces asymptot- 
ically correct classification whenever r > 2/3 — 1. Since 1 > /3 > 2/3 — 1, then 
the nearest-neighbor classifier enjoys greater sensitivity than the method 
based on extrema or FDR, no matter whether the perturbations are added 
to light- or heavy-tailed data components. 

2.4. Discussion of nearest-neighbor methods. The versatility, performance 
and simplicity of NN classifiers are important factors in their popularity. 
As we show in this paper, NN methods also have significant potential for 
"robustification" and for fine-tuning through thresholding; both of these 
modifications lead to further improvements in performance. Nevertheless, 
well-known caveats about NN techniques should be mentioned. 

Nearest-neighbor algorithms are most clearly suited to problems where 
the major departures among distributions are the results of differences in 
means, rather than differences in variances. To appreciate why NN classifiers 
can face challenges when the differences are principally in terms of variance, 
consider the elementary case where the variables X^^^ , for 1 < A; < p, are 
independent and identically distributed with zero mean and variance a^', 
the y(*=)'s are likewise i.i.d., with zero mean and variance ay] and cj^ < cxy. 
If Z comes from population X then, as p increases, the probability that the 
inequality 

(2.13) - f](xW - z^'^f < 1 f^(yW - zW)2 

p k=l p k=l 
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holds tends to 1, since the left-hand side and right-hand side are, respectively, 
equal to 2a\ + Op(l) and a\ + aY + Op(l). The probability that (2.13) holds 
when Z is from population Y also converges to 1, since in this setting the two 
sides of (2.13) are, respectively, o"^ + + Op(l) and 2o"y + Op(l). Therefore, 
no matter what population Z is from, the simple NN classifier will, with 
probability converging to 1 as p ^ oo, assign Z to the population with 
smaller variance, that is, to population X. This will hold true for samples 
of any sizes m and n, provided those quantities are kept fixed as p diverges. 

The result is quite different if the two populations have equal variances 
but unequal means. There, the probability that Z is correctly allocated by a 
NN classifier typically converges to 1 as p ^ oo, if there are sufficiently many 
sufficiently large differences among means. Although in Section 3 we shall 
permit distributions to take very different forms among components, the 
differences with real leverage for classification will be those among means. 
The process of thresholding, which converts continuous measurements into 
zero-one data, tends to remove problems caused by differences among vari- 
ances, although to some extent it converts differences among means into 
differences among variances; recall that a zero-one variable with mean q has 
variance q{l — q). However, as we shall show in Section 3, this does not cause 
significant difficulty. 

3. Theoretical properties. 

3.1. Summary. The models that we shall use to describe the X and 
Y vectors will differ through perturbations (location changes), added 
to individual components. The models will be constructed so as to admit 
considerable heterogeneity among the distributions, as well as to allow de- 
pendence; see Section 3.6 for discussion of the latter. In Sections 3.2 and 
2.3 we shall describe the density, size and scalability of the perturbations 
and marginal distributions. Classification boundaries will be discussed in 
Section 3.4. The principles introduced there will dictate the context of the 
main theoretical results given in Sections 3.5 and 3.6. These results will 
reflect difficult classification problems, where configurations are close to op- 
timal classification boundaries. Our main theorems will be stated under the 
assumption that the number of dimensions, p, diverges, while the sample 
sizes, m and n, are held fixed. 

3.2. Relationship between marginal distributions of X and Y . For se- 
quences hp and Cp depending on p, we write hp x Cp to mean that the ratio 
hp/cp is bounded above zero and below infinity, as p diverges. Given a se- 
quence ap diverging to infinity, and a constant /? G (^,1), we shall say that 

the sequence ji^^^ , . . . , /x^^^ "has asymptotic density p~^ and is on 
(3.1) the scale Op," if (i) the number, Np say, of nonzero /i^'^^'s satisfies 
Np xp^^^ and (ii) none of the nonzero /i^'^^'s is less than ap. 
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The perturbations fi^^^ will be added to the respective components of X 
to create a vector with the distribution of Y. Therefore, our model for the 
way in which the marginal distributions of X and Y are related will be that 

for 1 < k <p, Y^''^ is distributed as X^^^ + fi^^\ where the sequence 

(3.2) /x(^) , . . . , fj,^^ has asymptotic density and is on the scale Up, with 

Condition (3.2) relates only to the number of fx^'^^^s that are different from 
zero, not to the order of the nonzero values in the sequence fJ,^^\ ■ ■ ■ , /U^^-*. In 
particular, the assumption is much less stringent than it would be if it were 
supposed that the indices of the nonzero fi^''^ 's were distributed according 
to a particular random process. The latter constraint is implicit whenever 
mixture models are assumed. 

In (3.1) and (3.2), we choose /3 E (^, 1) since classification in the case (3 < ^ 
is relatively easy (indeed, root-n consistent estimation is generally possible 
when /3 < ^), and since nontrivial, asymptotically correct classification is 
impossible if /3 = 1. 

3.3. Scalability. We shall use the phrase, "the marginal distributions of 
X are continuous and scalable," to mean that, for each r £ (0, 1), the equa- 
tion 

(3.3) X^P(xW>ap)=pi-^ 

k=l 

has a unique solution Op = ap{r), and that for each e G (0,r) there exists 
C = C{e) G (0, 1) such that, for all sufficiently large p, 

p 

(3.4) P^^^^'^ > ^"p) ^ P^""~^'- 

k=l 

In particular, if the X^'^^'s are identically distributed as X^^\ then the 
common distribution is scalable if, when Op is defined by P^X^'^^ > Up) =p~^ , 
for each e G (0,r) there exists C € (0, 1), such that P(xW > Cap) < p^'''^^ . 
Scalable distributions include the normal, and other exponentially decreas- 
ing distributions such as the Subbotin, with probability density function / 
given by 

(3.5) /(x) = C-iexp(-|xr/7), 

where 7 > and Cj = 2r(l/7)7(-^/''')~^ . See Donoho and Jin [5] for an ac- 
count of the interest in, and applications of, the Subbotin distribution. Scal- 
able distributions also include regularly varying distributions such as the 
Pareto, for which 

(3.6) P{X'^''^ >x)=x-^, 
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when X > 1, where 7 > 0. Nonscalable distributions have extremely hght 
upper tails, for example, the extreme-value distribution for which P{X > 
x) = exp(— e^). 

Of course, scalability of the marginal distributions of X does not require 
the X^^^^s to be identically distributed. A particularly simple, nonidenti- 
cally distributed example is that where Ni{p) of the components X^''^ are 
distributed as X^^\ say; the other N2{p) =p — Ni{p) components have dis- 
tribution functions that dominate that of X^^\ in the sense that P{X'^^'> < 
t) > < t) for all 1 < /c < p and all t > to, say; the distribution of 

is scalable, in the sense described in the previous paragraph; and Ni{p) ~p 
as n — > 00. This model, and Theorems 1 and 2 below, permit a rigorous 
account of performance of the nearest-neighbor classifier in the context of 
the examples discussed in Section 2.3. 

3.4. Detection and classification boundaries. In this subsection, we as- 
sume that all the marginal distributions of X are identical to that of X^^") , 
say, and we take each of the p^~^ nonzero values of /i^'^^ to equal Op, defined 
by P(X(°) > Op) =p~'', where /? € (i, 1) and r G (0, 1). Theorems 1 and 2, 
below, imply that in this case the robust nearest-neighbor classifier defined 
by (2.9) will asymptotically correctly classify data, provided that 

(3.7) l-2(3 + r>0. 

That is, if (3.7) holds, and even when m = n = l (i.e., when there is only one 
training data value from each population) , the probability that the classifier 
at (2.9) correctly assigns Z, no matter whether it comes from the X or the 
Y population, converges to 1 as p — > 00. 

Conversely, if (/3, r) lies strictly below the boundary described by the line 

(3.8) l-2p + r = 0, 

then the probability of correct classification fails to converge to 1. Moreover, 
the same boundary plays the same role (i.e., as the border that separates 
classifiable and nonclassifiable cases) if we use a truncated standard nearest- 
neighbor method. The latter technique requires the data distributions to 
have several finite moments, whereas the approach suggested in our paper 
is far more robust than conventionally truncated nearest-neighbor methods. 
It is significant that the boundaries are identical in the cases of robust and 
nonrobust nearest-neighbor methods. In particular, the greater robustness 
of our approach does not come at the price of reduced effectiveness. 

To define standard truncated nearest-neighbor classifiers, let X^-J = 
XijI{Xij > t), Y^ - = YijI{Yij > t) and Zf = ZjI{Zj > t), respectively, where 
t denotes the truncation point. The corresponding truncated vectors are 
= (X^j), y/'- = (y//) and Z^' = {Zf). We apply the standard nearest- 
neighbor classifier to the truncated datasets {Xf, . . . , X^} and {Yi'^ , . . . , Y^^}, 
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instead of to the original data. That is, we assign Z to the X population 
if Z*'' is nearer to at least one of X'f^ than it is to any of the ^^'''''s, and 
we assign it to the Y population otherwise. Assume that the random vari- 
ables Xi^j^ — E{Xi-^j-^) and Yi^j^ — E{Yi^j^) are all independent and iden- 
tically distributed, with the distribution of U, say, and that the scalabil- 
ity condition holds. It can be proved that if q = p^~^; if the truncation 
point t does not exceed if u = ap, where Up satisfies (3.3) [or equivalently, 
P{U > ap) = p~'^]] and if lies strictly below the boundary given by 

(3.8); then pE{U'^I{U > t)} / [qu'^f is bounded away from zero. Moreover, 
it is shown by Hall, Pittelkow and Ghosh [8] that if, along a subsequence 
of values of p, pE{U^I{U > t)} / {qv'^)'^ does not converge to zero, then the 
probability of correct classification fails to converge to 1. Similarly, if (/3,r) 
lies above the boundary, then the probability of correct classification con- 
verges to 1. This establishes the implications of the boundary in the case of 
standard truncated nearest-neighbor classifiers, and its implications for our 
truncated form are similar. 

In some problems, and when m = n = 1, the boundary at (3.8) is identical 
to that for an optimal classifier, implying that the robust nearest-neighbor 
approach has asymptotically optimal performance. However, the classifiers 
for which this boundary is known require the marginal distribution to be 
known; our truncated, thresholded nearest-neighbor approach is not subject 
to that requirement. 

For example, in the Subbotin case represented by (3.5), with < 7 < 1; 
and in the Pareto case given by (3.6), when 7 > 0; it is known [5, 14] that the 
boundary represented by (3.8) is the optimal boundary for signal detection. 
It can be proved from this result that it is also the optimal boundary for 
classification, when m = n = 1. In the Subbotin case where 7 > 1, alternative 
methods, such as Donoho and Jin's [5] higher-criticism method and the 
approaches suggested by Ingster [11, 12, 13], give a lower optimal boundary 
even when m = n = 1 and, hence, permit classification in cases where robust 
nearest-neighbor methods do not. 

3.5. Case of independent components. The error rates of the classifier at 
(2.8) are defined to be the probability that Z is misclassified as coming from 
Y when it is really from X and the probability of misclassification of Z as 
coming from X when it is actually from Y. 

Note that, if t is sufficiently small, then it is possible to have P(X^^^ > 
t) = P{Y^^^ > t) = 1^ uniformly in l<k<p and in p. In this case, the ratio 
T{t)/S{t) is not well defined. To remove pathologies such as this, we modify 
the definition of 9, at (2.7), by insisting that, for some fixed to sufficiently 
large, only values t > to be considered. In the theorem below, we hold m and 
n fixed and let p increase without bound. 
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Theorem 1. If the components of X are independent, and the compo- 
nents ofY are independent; if the marginal distributions are related by (3.2), 
and are continuous and scalable; if, for r G (0,1), the quantity ap = ap{r), 
defined by (3.3), diverges to infinity but at a rate no faster than p^ for some 
D > 0, as p increases; if the pair {(3,r) is above the classification boundary, 
in the sense that (3.7) holds; and if Zp is given by (2.8), where diverges 
more slowly than p^ for each e > 0; then, as p ^ oo for fixed m and n, the 
error rates of the classifier at (2.9) converge to zero. 

The assumption in Theorem 1 that ap = 0{p^), for some D > 0, is satis- 
fied if, for example, sup;, E{\X^^'^ |^) < oo for some e > 0. 

3.6. Case of dependent components. As in (3.2), we take the distribu- 
tions of the components of Y to be translations of those of the respective 
components of X. In particular, given stochastic processes Ui,...,Up and 
Ui", . . . , U^, each with the same p-variate distribution, we define 

(3.9) xW = C/fc + i.fc, yW = [/# + z.,, + ;,W. 

The challenge is to model the degree of dependence among marginals and, 
at the same time, to permit the marginal distributions to vary in shape, as 
well as location, from one component to another. This is done through an 
exponentiated moving average process, defined in part (a) of (3.10): 

(a) Uk = J2j>i^j^j'+k^ where the nonnegative random variables 
Wj are independent and identically distributed as W; (b) for all w, 
pIw <uj) <1; (c) for some c> 0, E{W'^) < oo; (d) the distribution 
of W has a bounded probability density; (e) the constants are 
permitted to be functions of p as well as k, and for some C > 1, for 
all p and for all 1 < k <p, 1^ <^k ^ C; (f) for some C > 0, for 
some u; € (0, 1) and for all j >!, \ujj\ < Coj^; and (g) at least one Uj 
is strictly positive. 

The i^fc's are taken to be uniformly bounded, and the //^'^^'s to have prop- 
erties similar to those at (3.2): 

(a) Vk and ^^^^ are functions of p as well as k; (b) for a fixed con- 
^ slant C > 0, \vk\ ^ C for all p and for all 1 < k <p; and (c) given 
r € (0, 1) and (3 € (^, 1) and with ap defined by (3.3), the sequence 
. . . jfJ-^^^ has asymptotic density p~^ and is on the scale Up. 

The "continuity" part of the assumption, in Theorem 1 , that the marginal 
distributions of X are continuous and scalable, is taken care of by (3.10)(d). 
However, we also need scalability, as well as a version of that condition in the 



(3.10) 
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case of logarithmically spaced marginals. For the latter, (3.12) is sufficient: 
defining 7rfc(t) = > t), we ask that 

for each B,e > 0, there exists t' = t'{B,e) such that, if Ip de- 
(3.12) notes the integer part of Blogp, then p"" J2o<k<{p~h)/£p'^kep+h{t) > 
Ei<fc<p7rfc(t) for a\lO<h<£p and ah t > t' . 

In assumption (3.10), parts (a) and (e) imply that the Uk process is a gen- 
eralized moving average with geometrically decaying coefficients. The gen- 
eralization, through raising Wj+k to the power Ofc, allows the distribution of 
Uk to be varied substantially from one component to another. In particular, 
the tail weights can be very different; smaller a^'s give distributions with 
lighter tails. 

To interpret parts (b) and (g) of (3.10), note that if P{Uk < C) = 1 for 
some C > and for all k, then the problem of discriminating between X 
and Y , on the basis of location shifts to the right, is relatively simple. Part 
(b), which asserts that the upper tail of the distribution of W is unbounded, 
together with (g), which asks that at least one contribution uJjW^^j^ to C/^ 
be positive, permit us to avoid this degeneracy. Part (c) of (3.10) is a very 
weak moment assumption and, in particular, permits the distribution of W 
to be so heavy tailed that it lies in the domain of attraction of a stable law. 

In (3.11), parts (a) and (b) permit the i^^'s to vary quite generally, subject 
only to being bounded. Condition (3.12) holds true trivially if the marginal 
distributions are all identical and can be shown to be valid under other 
heterogeneous models. 

Theorem 2, below, is a version of Theorem 1 for dependent data. As in 
the case of Theorem 1, we modify the definition of 6, at (2.7), by considering 
only values t>tQ, for to fixed but sufficiently large. 

Theorem 2. // the joint distributions of the components of X and Y 
are given by (3.9), with the quantities there generated as described by (3.10) 
and (3.11); if the marginal distributions of X^'^^ are scalable, and satisfy 
(3.12); if, for r G (0, 1), the quantity Up = ap{r), defined by (3.3), diverges 
no faster than p^ for some D > 0, as p increases; if the pair (/?, r) lies above 
the classification boundary, in the sense that ( 3. 7) holds; and if Zp is given by 
(2.5), where ^p diverges more slowly than p^ for each e > 0; then, as p^ oo 
for fixed m and n, the error rates of the classifier at (2.9) converge to zero. 

4. Numerical properties. 

4.1. Microarray data. As a practical example, we compared the perfor- 
mance of the thresholded method with the nearest-neighbor method on the 
BRCA dataset [6, 9], which we obtained from http:/ /www. nejm.org/general/ 
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content/supplemental/hedenfalk/index.html. This dataset contains microar- 
ray data from patients with breast cancer, caused by two different types of 
mutations, labelled BRCAl and BRCA2. The expression level of each of 
3226 genes was measured in each patient, and there are 7 patients with 
BRCAl and 8 patients with BRCA2. 

This dataset (and indeed many gene microarray datasets) is very suited to 
our thresholded method. For a start, it is a dataset with very high dimension 
and low sample size. Furthermore, it is expected that only a few genes will 
be differentially expressed between the two types of cancer, so the difference 
between the populations is sparse. Lastly, the underlying distributions of 
the gene expressions are likely to be both heavy-tailed and with significant 
dependence among genes, which nearest-neighbor traditionally does poorly 
at, especially in comparison with the thresholded method. 

We tested the two methods on this dataset by calculating the cross- 
validation performance, where we classify each patient according to all the 
other patients and calculate the classification rate. For the nearest-neighbor 
method, cross-validation correctly classified 11 out of the 15 patients. Our 
thresholded method did a lot better; with Zp = 0.5(lnp)^/^, all 15 patients 
were classified correctly under cross-validation. In fact, this happened when 
we set the coefficient of (Inp)^/^ in Zp to be anywhere between 0.35 and 0.5. 

4.2. Simulated data. As an additional test, we also compared the thresh- 
olded method with the nearest-neighbor method for simulated data. We 
compare the two methods in the area of the /3-r plane where classification 
is possible (r > 2/3 — 1), but not easy (/J < | or r > 1). Overall, we found 
that in cases where standard nearest-neighbor does not perform well, the 
thresholded method improves on it. We look at some of these cases. 

4.2.1. Independent heavy-tailed marginal distributions. Nearest-neighbor 
methods do not do very well when the marginal distributions of the compo- 
nents of X (and Y) are heavy-tailed (i.e., go to slower than a normal distri- 
bution) . We compared the methods for simple models where m = n = l and 
each of the components of X are independent and have identical Student's-t 
distributions. By varying the degrees of freedom, we can observe the behav- 
ior of the methods relative to the heaviness of the tails. 

For this case, if we are given a threshold t, the success rate of the algorithm 
can be approximated very accurately, for any j3 and r, by looking at the 
contribution of each dimension to T[t). By varying t we can calculate the 
optimal threshold, which we call the a priori optimal threshold, and also 
the best possible performance of the classifier. However, we are not usually 
given the threshold, so this is an upper limit on the success rate. Instead, 
we compare the classifiers with empirically chosen thresholds, on simulated 
data with p up to 20,000. 
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We found that, for sufficiently heavy tails, the thresholded method dom- 
inates standard nearest-neighbor in all areas of the /3-r plane. In fact, the 
success rate of the thresholded method actually improves for heavier tails. 
As the tails get lighter (the d.f. gets larger), the success rate declines, and 
nearest-neighbor does better in a small area in the plane, which grows and 
moves around as the tail weight decreases. For small d.f., this area occurs 
at high P and r [see Figure 1(a)]; for larger d.f., this area occurs at low (3 
and r neither high nor low [see Figure 1(b)]. 

The thresholded method also dominates nearest-neighbor if we use the 
a priori optimal thresholds, for sufficiently heavy tails. If the tails are not 
heavy enough, nearest-neighbor works better for low (3 and r. 

We found that the best performance of the thresholded method is achieved 
when we take Zp in (2.7) to be c(lnp)^/^, where c is a constant. The value of 
c, which maximizes the success rate, lies between 0.3 and 0.9, depending on 
P and r. However, the best success rate achieved with an empirically chosen 
threshold is worse than that achieved with the a priori threshold, because 
the empirical threshold is not constant for constant Zp. Figure 2 estimates 
the distribution of the chosen threshold for various cases when Zp is close to 
optimal. Figure 3 shows how the value of the threshold affects the success 
rate, while Figure 4 shows how the value chosen for Zp affects the success rate. 
In both of these figures, the curves represent the thresholded method, while 
the horizontal lines show the performance of the nearest-neighbor method 
for comparison. 



4.2.2. Dependent normal marginal distributions. Another case where stan- 
dard nearest-neighbor methods perform badly is when the components of X 
are dependent on each other. We compared the methods for varying degrees 




as 07 oe ot 




(a) 

Fig. 1. Areas where the two methods perform better, for heavy-tailed distributions. The 
nearest-neighbor method performs better in the shaded area; otherwise, the thresholded 
method is better, (a) t distributions, d.f. — 4, (b) t distributions, d.f. — 10. 
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and types of dependence; for example, when the components of X are mov- 
ing averages of independent standard normal variables, or weighted moving 
averages, or an autoregressive process X^*"'"^) = aX^"^^ + (1 — q)N^'^\ where 

is a sequence of independent standard normal variables. 

Again, we found that for sufficient levels of dependence, the thresholded 
method dominates nearest-neighbor for all (/3,r). For weaker levels of de- 
pendence, the nearest-neighbor method works better in a small area at small 
/3 and r neither small nor large (see Figure 5), and this region grows with 
decreasing dependence. We found that the strength of the dependence [e.g., 
cov affects the size of this region more than the length of the 

dependence (the number of components of X dependent on a given compo- 
nent). 



f 



KB 




Fig. 3. Success rate vs. threshold (as a proportion of shift amount) for 
p = 20,000, (/3,r) = (0.7,0.4). 
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Fig. 4. Success rate vs. c /or p = 20,000, (/3, r) = (0.7, 0.4), «)/iere = c(lnp)^''^ . 

As with the heavy-tailed case, taking Zp = c(logp)-'^/^ optimizes the success 
rate, with c taking similar values as before. However, the overall success rate 
of the thresholded method is worse than for an equivalent independent case. 
The behavior of the chosen threshold, and its effect on the success rate, is 
similar to its behavior for heavy-tailed distributions. 

4.2.3. Independent normal marginal distributions. For comparison, we 
also looked at the case where the components of X were independent and 
normally distributed. Here, the thresholded method does not dominate nearest- 
neighbor, which works better for low /3 (approximately P < 0.65). This is 
consistent with heavy-tailed distributions as the tails get lighter. The be- 
havior of the chosen threshold, and its effect on the success rate, is again 
similar to its behavior for heavy-tailed distributions. The overall success 




Fig. 5. Areas where the two methods work best, for moving averages of 5 normal random 
variables. 
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rate is worse than for heavy-tailed distributions, but better than that for 
dependent distributions. 

4.2.4. Larger samples. The above scenarios all involved m = n = 1. As 
the sample sizes m and n increase, but are kept equal, the classification 
success rate of both methods increase. As m and n increase, the thresholded 
method outperforms the nearest-neighbor method for a greater range of the 
j3-r plane, although the difference is slight up to m = n = 10 (the upper 
limit of our testing). 

When the sample sizes are not equal, the thresholded method performs 
better when m is smaller, if m -|- n is kept constant. In fact, although in- 
creasing m or n while keeping the other fixed generally increases the clas- 
sification rate, it is possible to decrease the classification rate by increas- 
ing m while keeping n fixed (e.g., when n = 1). As the effectiveness of the 
nearest-neighbor method stays largely the same, the thresholded method 
outperforms the nearest-neighbor method for much larger areas of the /3-r 
plane, when m <n, and is much less effective for m> n. 
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